Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Effects of OCR errors on ranking and feedback using the vector space model

Identifieur interne : 002682 ( Main/Exploration ); précédent : 002681; suivant : 002683

Effects of OCR errors on ranking and feedback using the vector space model

Auteurs : Kazem Taghva [États-Unis] ; Julie Borsack [États-Unis] ; Allen Condit [États-Unis]

Source :

RBID : ISTEX:2022C26E3682F8C2CDD3580811393DEEE55E8CA8

Descripteurs français

English descriptors

Abstract

We report on the performance of the vector space model in the presence of OCR errors. We show that average precision and recall is not affected for our full text document collection when the OCR version is compared to its corresponding corrected set. We do see divergence though between the relevant document rankings of the OCR and corrected collections with different weighting combinations. In particular, we observed that cosine normalization plays a considerable role in the disparity seen between the collections. Furthermore, we show that even though feedback improves retrieval for both collections, it can not be used to compensate for OCR errors caused by badly degraded documents.

Url:
DOI: 10.1016/0306-4573(95)00058-5


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Effects of OCR errors on ranking and feedback using the vector space model</title>
<author>
<name sortKey="Taghva, Kazem" sort="Taghva, Kazem" uniqKey="Taghva K" first="Kazem" last="Taghva">Kazem Taghva</name>
</author>
<author>
<name sortKey="Borsack, Julie" sort="Borsack, Julie" uniqKey="Borsack J" first="Julie" last="Borsack">Julie Borsack</name>
</author>
<author>
<name sortKey="Condit, Allen" sort="Condit, Allen" uniqKey="Condit A" first="Allen" last="Condit">Allen Condit</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:2022C26E3682F8C2CDD3580811393DEEE55E8CA8</idno>
<date when="1996" year="1996">1996</date>
<idno type="doi">10.1016/0306-4573(95)00058-5</idno>
<idno type="url">https://api.istex.fr/document/2022C26E3682F8C2CDD3580811393DEEE55E8CA8/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000017</idno>
<idno type="wicri:Area/Istex/Curation">000017</idno>
<idno type="wicri:Area/Istex/Checkpoint">001A93</idno>
<idno type="wicri:doubleKey">0306-4573:1996:Taghva K:effects:of:ocr</idno>
<idno type="wicri:Area/Main/Merge">002826</idno>
<idno type="wicri:source">INIST</idno>
<idno type="RBID">Pascal:96-0295002</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000A07</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000991</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000955</idno>
<idno type="wicri:doubleKey">0306-4573:1996:Taghva K:effects:of:ocr</idno>
<idno type="wicri:Area/Main/Merge">002A42</idno>
<idno type="wicri:Area/Main/Curation">002682</idno>
<idno type="wicri:Area/Main/Exploration">002682</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a">Effects of OCR errors on ranking and feedback using the vector space model</title>
<author>
<name sortKey="Taghva, Kazem" sort="Taghva, Kazem" uniqKey="Taghva K" first="Kazem" last="Taghva">Kazem Taghva</name>
<affiliation wicri:level="1">
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Information Science Research Institute, University of Nevada, Las Vegas, Nev.</wicri:regionArea>
<wicri:noRegion>Nev.</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Borsack, Julie" sort="Borsack, Julie" uniqKey="Borsack J" first="Julie" last="Borsack">Julie Borsack</name>
<affiliation wicri:level="1">
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Information Science Research Institute, University of Nevada, Las Vegas, Nev.</wicri:regionArea>
<wicri:noRegion>Nev.</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Condit, Allen" sort="Condit, Allen" uniqKey="Condit A" first="Allen" last="Condit">Allen Condit</name>
<affiliation wicri:level="1">
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Information Science Research Institute, University of Nevada, Las Vegas, Nev.</wicri:regionArea>
<wicri:noRegion>Nev.</wicri:noRegion>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">Information Processing and Management</title>
<title level="j" type="abbrev">IPM</title>
<idno type="ISSN">0306-4573</idno>
<imprint>
<publisher>ELSEVIER</publisher>
<date type="published" when="1996">1996</date>
<biblScope unit="volume">32</biblScope>
<biblScope unit="issue">3</biblScope>
<biblScope unit="page" from="317">317</biblScope>
<biblScope unit="page" to="327">327</biblScope>
</imprint>
<idno type="ISSN">0306-4573</idno>
</series>
<idno type="istex">2022C26E3682F8C2CDD3580811393DEEE55E8CA8</idno>
<idno type="DOI">10.1016/0306-4573(95)00058-5</idno>
<idno type="PII">0306-4573(95)00058-5</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0306-4573</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Character recognition</term>
<term>Document retrieval</term>
<term>Document retrieval system</term>
<term>Error</term>
<term>Full text</term>
<term>Influence</term>
<term>Optical reading</term>
<term>Test</term>
<term>Vector space model</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Erreur</term>
<term>Essai</term>
<term>Influence</term>
<term>Lecture optique</term>
<term>Modèle espace vectoriel</term>
<term>Recherche documentaire</term>
<term>Reconnaissance caractère</term>
<term>Système documentaire</term>
<term>Texte intégral</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Essai</term>
<term>Recherche documentaire</term>
<term>Système documentaire</term>
</keywords>
</textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">We report on the performance of the vector space model in the presence of OCR errors. We show that average precision and recall is not affected for our full text document collection when the OCR version is compared to its corresponding corrected set. We do see divergence though between the relevant document rankings of the OCR and corrected collections with different weighting combinations. In particular, we observed that cosine normalization plays a considerable role in the disparity seen between the collections. Furthermore, we show that even though feedback improves retrieval for both collections, it can not be used to compensate for OCR errors caused by badly degraded documents.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>États-Unis</li>
</country>
</list>
<tree>
<country name="États-Unis">
<noRegion>
<name sortKey="Taghva, Kazem" sort="Taghva, Kazem" uniqKey="Taghva K" first="Kazem" last="Taghva">Kazem Taghva</name>
</noRegion>
<name sortKey="Borsack, Julie" sort="Borsack, Julie" uniqKey="Borsack J" first="Julie" last="Borsack">Julie Borsack</name>
<name sortKey="Condit, Allen" sort="Condit, Allen" uniqKey="Condit A" first="Allen" last="Condit">Allen Condit</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002682 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 002682 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:2022C26E3682F8C2CDD3580811393DEEE55E8CA8
   |texte=   Effects of OCR errors on ranking and feedback using the vector space model
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024